NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Re- inforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second- order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learn- ing based optimism algorithm achieves a second- order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is rela- tively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtain- ing second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of the 41st International Conference on Machine Learning)

In this paper, we prove that Distributional Reinforcement Learning (DistRL), which learns the return distribution, can obtain second-order bounds in both online and offline RL in general settings with function approximation. Second-order bounds are instance-dependent bounds that scale with the variance of return, which we prove are tighter than the previously known small-loss bounds of distributional RL. To the best of our knowledge, our results are the first second-order bounds for low-rank MDPs and for offline RL. When specializing to contextual bandits (one-step RL problem), we show that a distributional learning based optimism algorithm achieves a second-order worst-case regret bound, and a second-order gap dependent bound, simultaneously. We also empirically demonstrate the benefit of DistRL in contextual bandits on real-world datasets. We highlight that our analysis with DistRL is relatively simple, follows the general framework of optimism in the face of uncertainty and does not require weighted regression. Our results suggest that DistRL is a promising framework for obtaining second-order bounds in general RL settings, thus further reinforcing the benefits of DistRL.
more » « less
Full Text Available
More Benefits of Being Distributional: Second-Order Bounds for Reinforcement Learning

Wang, Kaiwen; Oertell, Owen; Agarwal, Alekh; Kallus, Nathan; Sun, Wen (July 2024, Proceedings of Machine Learning Research)

Full Text Available
The Non-linear $ F $-Design and Applications to Interactive Learning

Agarwal, Alekh; Qian, Jian; Rakhlin, Alexander; Zhang, Tong (April 2024, Forty-first International Conference on Machine Learning)

Full Text Available
Provable Benefits of Representational Transfer in Reinforcement Learning

Agarwal, Alekh; Song, Yuda; Sun, Wen; Wang, Kaiwen; Wang, Mengdi; Zhang, Xuezhou (July 2023, The Conference on Learning Theory)

Full Text Available
Adversarially Trained Actor Critic for Offline Reinforcement Learning

Cheng, Ching-An; Xie, Tengyang; Jiang, Nan; Agarwal, Alekh (July 2022, Proceedings of the 39th International Conference on Machine Learning)

We propose Adversarially Trained Actor Critic (ATAC), a new model-free algorithm for offline reinforcement learning (RL) under insufficient data coverage, based on the concept of relative pessimism. ATAC is designed as a two-player Stackelberg game: A policy actor competes against an adversarially trained value critic, who finds data-consistent scenarios where the actor is inferior to the data-collection behavior policy. We prove that, when the actor attains no regret in the two-player game, running ATAC produces a policy that provably 1) outperforms the behavior policy over a wide range of hyperparameters that control the degree of pessimism, and 2) competes with the best policy covered by data with appropriately chosen hyperparameters. Compared with existing works, notably our framework offers both theoretical guarantees for general function approximation and a deep RL implementation scalable to complex environments and large datasets. In the D4RL benchmark, ATAC consistently outperforms state-of-the-art offline RL algorithms on a range of continuous control tasks.
more » « less
Full Text Available
Towards a Dimension-Free Understanding of Adaptive Linear Control

Perdomo, Juan C; Simchowitz, Max; Agarwal, Alekh; Bartlett, Peter (October 2021, Proceedings of Thirty Fourth Conference on Learning Theory)
null (Ed.)
Full Text Available
Towards a Dimension-Free Understanding of Adaptive Linear Control

Perdomo, Juan; Simchowitz, Max; Agarwal, Alekh; Bartlett, Peter L. (July 2021, Proceedings of the 34th Conference on Learning Theory (COLT2021))
null (Ed.)
Full Text Available
Model-Based Reinforcement Learning with a Generative Model is Minimax Optimal

Agarwal, Alekh; Kakade, Sham; Yang, Lin F. (July 2020, Proceedings of Machine Learning Research)

This work considers the sample and computational complexity of obtaining an $$\epsilon$$-optimal policy in a discounted Markov Decision Process (MDP), given only access to a generative model. In this model, the learner accesses the underlying transition model via a sampling oracle that provides a sample of the next state, when given any state-action pair as input. We are interested in a basic and unresolved question in model based planning: is this naïve “plug-in” approach — where we build the maximum likelihood estimate of the transition model in the MDP from observations and then find an optimal policy in this empirical MDP — non-asymptotically, minimax optimal? Our main result answers this question positively. With regards to computation, our result provides a simpler approach towards minimax optimal planning: in comparison to prior model-free results, we show that using \emph{any} high accuracy, black-box planning oracle in the empirical model suffices to obtain the minimax error rate. The key proof technique uses a leave-one-out analysis, in a novel “absorbing MDP” construction, to decouple the statistical dependency issues that arise in the analysis of model-based planning; this construction may be helpful more generally.
more » « less
Full Text Available
Taking a Hint: How to Leverage Loss Predictors in Contextual Bandits?

Wei, Chen-Yu; Luo, Haipeng; Agarwal, Alekh (July 2020, Conference on Learning Theory)

Full Text Available

« Prev Next »

Search for: All records